Effective GPU Strategies for LU Decomposition
نویسندگان
چکیده
GPUs are becoming an attractive computing platform not only for traditional graphics computation but also for general-purpose computation because of the computational power, programmability and comparatively low cost of modern GPUs. This has lead to a variety of complex GPGPU applications with significant performance improvements. The LU decomposition represents a fundamental step in many computationally intensive scientific applications and it is often the costly step in the solution process because of the impact of size of the matrix. In this paper we implement three different variants of the LU decomposition algorithm on a Tesla C1060 and the most significant LU decomposition that fits the highly parallel architecture of modern GPUs is found to be Update through Column with shared memory access implementation. Keywords—LU decomposition, CUDA, GPGPU
منابع مشابه
Parallel Triangular Solvers on GPU
In this paper, we investigate GPU based parallel triangular solvers systematically. The parallel triangular solvers are fundamental to incomplete LU factorization family preconditioners and algebraic multigrid solvers. We develop a new matrix format suitable for GPU devices. Parallel lower triangular solvers and upper triangular solvers are developed for this new data structure. With these solv...
متن کاملRandomized LU Decomposition Using Sparse Projections
A fast algorithm for the approximation of a low rank LU decomposition is presented. In order to achieve a low complexity, the algorithm uses sparse random projections combined with FFTbased random projections. The asymptotic approximation error of the algorithm is analyzed and a theoretical error bound is presented. Finally, numerical examples illustrate that for a similar approximation error, ...
متن کاملParallelization of the LU Decomposition on Heterogeneous Systems
With the appearance of GPUs as valid platforms, not only for graphics computation, but also general-purpose computations, applications that exploit hybrid/heterogeneous systems can be made available to the mass market due to the widespread availability of these systems. Correct distribution of the workload of these applications can lead way to significant performance boosts to complex applicati...
متن کاملAutomatically Tuned Dense Linear Algebra for Multicore+GPU
The Multicore+GPU architecture has been adopted in some of the fastest supercomputers listed on the TOP500. The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures processors like Multicore+GPU. However, to provide portable performance, manual parameter tuning is required. This paper presents automatically tuned LU factorizat...
متن کاملLocality Optimization on a NUMA Architecture for Hybrid LU Factorization
We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We apply these placement strategies ...
متن کامل